Assignment Brief

Exercises

This homework focuses on training and evaluating prediction models for a particular problem and dataset. The data comes from the Centers for Disease Control and Prevention (CDC: https://covid.cdc.gov/covid-data-tracker/). CDC is a USA health protection agency and is in charge of collecting data about the COVID-19 pandemic, and in particular, tracking cases, deaths, and trends of COVID-19 in the United States. CDC collects and makes public deidentified individual-case data on a daily basis, submitted using standardized case reporting forms. In this analysis, we focus on using the data collected by CDC to build a data analytics solution for death risk prediction.

The dataset we work with is a sample of the public data released by CDC, where the outcome for the target feature death_yn is known (i.e., either 'yes' or 'no'): https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf

The goal in this homework is to work with the data to build and evaluate prediction models that capture the relationship between the descriptive features and the target feature death_yn. For this homework you are asked to use the same dataset allocated to you in Homework1 (you can use your cleaned/prepared CSV from Homework1 or start from the raw dataset, clean it according to concepts covered in the lectures/labs, then use it for training prediction models).

There are 5 parts for this homework. Each part has an indicative maximum percentage given in brackets, e.g., part (1) has a maximum of 25% shown as [25]. The total that can be achieved is 100.

(1). [25] Data Understanding and Preparation: Exploring relationships between feature pairs and selecting/transforming promising features based on a given training set.

- (1.1) Randomly shuffle the rows of your dataset and split the dataset into two datasets: 70% training and 30% test. Keep the test set aside. For shuffling, please remember to set the random state so the split is always the same, this helps with reproducing and verifying your results.
- (1.2) On the training set:
    - Plot the correlations between all the continuous features (if any). Discuss what you observe in these plots.
    - For each continuous feature, plot its interaction with the target feature (a plot for each pair of   continuous feature and target feature). Discuss what you observe from these plots, e.g., which continuous features seem to be better at predicting the target feature? Choose a subset of continuous features you find promising (if any). Justify your choices.
    - For each categorical feature, plot its pairwise interaction with the target feature. Discuss what  knowledge you gain from these plots, e.g., which categorical features seem to be better at predicting the target feature? Choose a subset of categorical features you find promising (if any). Justify your choices.


(2). [15] Predictive Modeling: Linear Regression.

- (2.1) On the training set, train a linear regression model to predict the target feature, using only the  descriptive features selected in exercise (1) above. 
- (2.2) Print the coefficients learned by the model and discuss their role in the model (e.g., interpret the model by analysing each coefficient and how it relates each input feature to the target feature).    
- (2.3) Print the predicted target feature value for the first 10 training examples. Threshold the predicted target feature value given by the linear regression model at 0.5, to get the predicted class for each example. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.
- (2.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained on the training (70%) dataset. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated random train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and discuss your findings.

(3). [15] Predictive Modeling: Logistic Regression.

- (3.1) On the training set, train a logistic regression model to predict the target feature, using the descriptive features selected in exercise (1) above.   
- (3.2) Print the coefficients learned by the model and discuss their role in the model (e.g., interpret the model).    
- (3.3) Print the predicted target feature value for the first 10 training examples. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.
- (3.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained when using the training (70%) dataset for evaluation. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and discuss your findings.


(4). [20] Predictive Modeling: Random Forest.

- (3.1) On the training set, train a random forest model to predict the target feature, using the descriptive features selected in exercise (1) above.   
- (3.2) Can you interpret the random forest model? Discuss any knowledge you can gain in regard of the working of this model.   
- (3.3) Print the predicted target feature value for the first 10 training examples. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.
- (3.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained when using the training (70%) dataset for evaluation. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and to the out-of-sample error and discuss your findings.

(5). [25] Improving Predictive Models.

- (4.1) Which model of the ones trained above performs better at predicting the target feature? Is it more   accurate than a simple model that always predicts the majority class (i.e., if 'no' is the majority class in your dataset, the simple model always predicts 'no' for the target feature)? Justify your answers.
- (4.2) Summarise your understanding of the problem and of your predictive modeling results so far. Can you think of any new ideas to improve the best model so far (e.g., by using furher data prep such as: feature selection, feature re-scaling, creating new features, combining models, or using other knowledge)? Please show how your ideas actually work in practice, by training and evaluating your proposed models. Summarise your findings so far.
- (4.3) Take your best model trained and selected based on past data (ie your cleaned Homework1 dataset), and evaluate it on the new test dataset provided with this homework (in file '24032021-covid19-cdc-deathyn-recent-10k.csv'). Discuss your findings.     

Data Analytics - Modelling

Audit

Author: ARyan - 14395076

Module: COMP47350

DC: 2021-02-08

DLM: 2021-04-28

Desc: This file builds upon my analysis of the COVID19 data set and produces models to predict death.

Dict: The Data Dictionary for the Data Set is available at: https://www.cdc.gov/coronavirus/2019-ncov/downloads/data-dictionary.pdf

Table of Contents

- Homework 1:

  1. Introduction

  2. Data Quality Report

  3. Data Quality Plan

  4. Extension Commentary and Analysis

Homework 2:

  1. Exploratory Analysis

  2. Model Creation and Analysis

  3. Model Extension and Refinement

-

00. Introduction

00.01 Background

COVID-19 is an infectious disease caused by SARS-CoV-2, a coronavirus strain discovered in December 2019 first identified following an outbreak in the Chinese city Wuhan, with the WHO declaring the outbreak a global pandemic in March 2020.

Since its discovery, health organisations have been actively gathering data to assess aspects of the disease including infectivity, symptoms, and mortality rate. Active interest has been paid to factors which may increase a patient's risk of serious symptons or death.

In this analysis, we focus on using the data collected by CDC to build an analytics solution for predicting a patients' death risk prediction. CDC collects demographic characteristics, exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and comorbidities. It also includes information on whether the individual survived or not.

00.02 Problem Scope

We wish to develop a model to predict the risk of a patient dying based on various metrics collected by the CDC.

00.03 Data

The CDC collects demographic data, exposure history, disease severity indicators, outcomes, clinical data, comorbidities, and whether the patient survived. The full data dictionary provided by the CDC is available at the following location: https://www.cdc.gov/coronavirus/2019-ncov/downloads/data-dictionary.pdf

For this assignment, a sample of ten thousand rows are provided from the full dataset available from: https://covid.cdc.gov/covid-data-tracker/

00.04 Approach

The assignment was broadly approached as follows but these were non-hard boundaries:

  1. Exploratory Data Analysis - Investigsating the Data Set Provided.
  2. Data Quality Report - Investigating the Sample Provided for Data Quality Issues
  3. Data Quality Plan - Developing a plan to address and action issues with data quality.
  4. Extending Data - Addding additional features
  5. Exploratory Analysis - Comparing the relationships between key feature pairs for our train and test split.
  6. Test Train Split - Splitting the data into a training and test set.
  7. Exploratory Analysis of Training Data - Exploring Feature Relations.
  8. Linear Model - Creating a Linear Model for this classification problem.
  9. Logistic Model - Creating a logistic model for this provlem.
  10. Random Forest - Creating a RF for this problem.
  11. Improving Models - Refining existing models to improve accuracy.

00.05 Limitation

As requested in the exercise, the key findings are prepared within the Notebook File and accompanying PDFs.

1. Data Quality Report

0. Background

COVID-19 is an infectious disease caused by SARS-CoV-2, a coronavirus strain discovered in December 2019 first identified following an outbreak in the Chinese city Wuhan, with the WHO declaring the outbreak a global pandemic in March 2020.

Since its discovery, health organisations have been actively gathering data to assess aspects of the disease including infectivity, symptoms, and mortality rate. Active interest has been paid to factors which may increase a patient's risk of serious symptons or death.

In this analysis, we focus on using the data collected by CDC to build an analytics solution for predicting a patients' death risk prediction. CDC collects demographic characteristics, exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and comorbidities. It also includes information on whether the individual survived or not.

1. Overview

This report will outline the initial findings based on the provided sample of the CDC dataset. It will summarise the data, describe the various data quality issues observed and how they will be addressed.

Appendix includes terminology, assumptions, explanations and summary of changes made to the original dataset. This also includes feature summaries and boxplots used to visualise the data.

2. Summary

The following are the key points in relation to the data set and approach:

3. Logical Integrity

As the dataset has a heavy focus on categorical data, the following tests were carried out to asses the integrity of the dataset

4. Non-Datetime Categorical Features

There are 8 non-categorical features in the dataset:

5 Datetime Categorical Features

There are 4 categorical datetime features in the dataset:

6 BoxPlots

BoxPlots were produced for all categorical data. These are present in the appendix due to the size of the file. All pairs of data and single value info was calculated as an initial exploration.

8. Note:

The steps provided in the assignment outline more of a linearisation in the process, however upon reviewing the data I did not believe the outlined processed was particularly suitable for this dataset.

In particular, the processing steps outlined suggest the removal of duplicate values prior to data exploration. As I did not beleive the records were, in fact, duplicates but instead were driven by other elements, it was more reasonable to explore the relationships between various factors before taking any steps to drop rows with overlap, in order to better understand why.

Similarly, the steps provided suggest not adding columns until the final section. Due to the nature of the data and the variety of missing values within some of the indicator and date columns, it seemed to me that valuable information could be obtained based on my initial exploration before any final removal occurs. In particular, the onset datetime column looks to have key value in relation to the asymptomatic prevalence of COVID and the time between initial presentation and symptom onset date. Therefore, adjusting the nature of this column and adding on attributes which reflected the data that was in the original column while preserving and enhancing the data set was logical as an approach before simply dropping this feature for missing prevalency. Similarly, the race column contains race and ethnicity combined however this can be replaced with the racial info as that alone is sufficient to capture the concatenated nature of this. While there may be a need from a reporting purpose in the CDC to compare Hispanic vs Non-Hispanics demographics, reducing the memory usage of the field by stripping the redundant info still allows recovery if this would be insightful.

Due to all of the above, the data quality plan and data quality actioning were, in a sense, completed as a joint process as proper cleansing of the set did not allow for a full linearisation of this process. This steps is detailed below.

2 Data Quality Plan

Based on the initial insights, the following is the data quality plan. Full details on reasoning have been already outlined in the data quality report.

A key note is the author wishes to avoid dropping data as an intermediate step unless necessary or directly contradictory data. Acquisition cost of data is too significant to justify dropping data until a step just prior to usage in ML models as retrieval can be challenging. As such, data is being imputed into missing values in general. The Data Action Dictionary is:

data_action_dictionary=

                      {
                      'cdc_case_earliest_dt':
                            {
                            "Data Quality Issues": "515 Rows where not minimum of other dates populated"
                            ,"Data Quality Actions": "Confirm reason. Otherwise leave as-is"
                            }

                        ,'cdc_report_dt':                                
                            {
                            "Data Quality Issues": "Depreciated"
                            ,"Data Quality Actions":"Drop"
                            }

                        ,'pos_spec_dt':
                            {
                            "Data Quality Issues":"72% of data missing"
                            ,"Data Quality Actions":"Drop after using for status correction"
                            }


                        ,'onset_dt':
                            {
                            "Data Quality Issues":"49% of Data Missing. <1% of dates where onset_dt is too far after case date."
                            ,"Data Quality Actions":"Split into days since symptom. Flag missing data. Drop column. Statistically relevant. Enquire on why some values are so extreme after earliest date"
                            }


                        ,'current_status':

                            {
                            "Data Quality Issues": "Probable Cases that should be Laboratory Confirmed Cases"
                            ,"Data Quality Actions":"Update instances"
                            }


                        ,'sex':
                            {
                            "Data Quality Issues": "Missing and Unknown flags"
                            ,"Data Quality Actions": "Bin into Unknown category"
                            }


                        ,'age_group':
                            {
                            "Data Quality Issues": "Missing and Unknown flags"
                            ,"Data Quality Actions":"Bin into groups"
                            }

                      ,'race_ethnicity_combined':
                            {
                            "Data Quality Issues":"Concatenated field. Race sufficient to capture all info."
                            ,"Data Quality Actions":"Split field and drop ethnicity"
                            }


                      ,'hosp_yn':
                            {
                            "Data Quality Issues":"Missing, Unknown, and OTH values"
                            ,"Data Quality Actions":"Bin unknown into groups"
                            }


                     ,'icu_yn':
                            {
                            "Data Quality Issues":"Missing data 72%."
                            ,"Data Quality Actions":"Determine if missing because 'no'. Column is relevant so await answer before dropping"
                            }


                     ,'death_yn':
                            {
                            "Data Quality Issues":"Not applicable"
                            ,"Data Quality Actions":"No action"
                            }


                     ,'medcond_yn':
                                {
                            "Data Quality Issues":"80% missing"
                            ,"Data Quality Actions":"Grouping missing consistently. Column is relevant so keep until answer on cause of missing values"
                            }

            }

A key note is the author wishes to avoid dropping data as an intermediate step unless necessary or directly contradictory data. Acquisition cost of data is too significant to justify dropping data until a step just prior to usage in ML models as retrieval can be challenging. As such, data is being imputed into missing values in general.

3. Extension and Analysis Commentary

Extension Commentary

I elected to pair and plot all combinations of features within the dataset.

To extend the set, I created day, month, year, and workday featuers for the cdc_case date. This was primarily to help determine if cases followed any trend in terms of timing in the week or month or year which could be insightful. Confounding factors could be if certain areas operate on a rotating staff basis, then potentially trends in deaths could point for further area to investigate.

Adding on to my earlier analysis and inspection, I changed the onset date into a column highlighting the number of days after diagnosis that symptoms appeared. My initial hypothesis surrounding this is that individuals who got tested and did not become symptomatic until later would have had a better expected outcome due to earlier intervention and treatment management and that this could have predictive power in determining if a patient was at risk of dying.

Finally, I added flags for whether demographic or medical data was missing for a particular record. Although I personally wished to avoid removing duplicates until it becomes fed into an ML model, it will be necessary to experiment with different featuers being present or absent given the high quantity of missing values within the dataset. These flags are to provide a convenient way to filter the dataset and focus on the rows where a full dataset is present if needed.

Analysis

For the purpose of analysing pairs of featurs, beyond some of the analysis already conducted, I am electing to focus on plotting the target feature death_yn verssu other cathegorical columns. Other features of interest may be pbriefly discussed however the year focus will be on the death_yn feature against others. Unfortunately as the data is primarily categorical, the analysis focuses primarily on feature distribution.

Key Points:

One area of note is that a 100% stacked bar chart diminishes the importance of how prevalent features actually area and fails to saccount for how single instances are more likely to have an impact on a feature of smaller size. Due to this, the barcharts which are stacked can give a highly misleading view of the data (although can be useful to gain a perspective on relevant factors to our model) and should be considered in relation to the stacked (but not 100% stacked) bar charts produced continuously within the report.


Homework 2:


4. Exploratory Analysis

Following the splitting of our dataset into train and test features, we arrive at the below:

  1. current_status vs death

Probable cases interestingly are slightly more like than laboratory confirmed cases to have resulted in death. Suspicion is that this is due to retrospective classification of this data. Overall probably case volume is small (>5% of data) so this is likely not a significant indicator for future data.

  1. sex vs death

Males are more likely to be flagged as death in the training data set. There is potentially a higher risk for males. A key factor as to why this might be is that the life expectancy for males in the US is lower than females. Particularly in the older categories, males might be at a more pronounced risk due to overall lower life expectancy resulting in a greater susceptibility to COVID. Ultimately, the differences are relatively minor by sex, so this is unlikely to have a significant sway within our model given that there are some features (e.g. age, icu, medcond, hosp) which have a more significant correlation.

  1. age_group vs death

Age group is a highly significant factor. From 40+, there is a greatly increased likelhiood of death. This aligns highly with what is currently known about COVID, where people in older categories are at a greatly increased risk of significant COVID complications. We see that particularly in people in the 80+ category, there is a very significant motality rate increase, and as such this is likely to be a very important predictor of death in our dataset.

  1. hospitalised vs death

We see that hospitalisation is correlated with an increased likelihood of death within the set which is not too surprising given that those who are hospitalised are more likely to have a more serious presentation of COVID than those who do not require hospitalisation.

  1. ICU vs death

We see ICU admission has a very significant impact on whether somebody is likely to be flagged as having died. Similarly to hospitalisation, this is likely because those who require an ICU admission are likely to have an extreme presentation of COVID, and as such being admitted to ICU is likely to be a very strong indicator of your prognosis. The proportion of missing aligns with those who were not admitted to the ICU. As described in HW1, I strongly suspect that a missing ICU indicator is in fact an indicator that the patient was not admitted to the ICU. The proportion of admissions supports this, and as analysed in assignment 1, this is further supported by the ICU flag being most heavily missing in patients in younger age categories; likely for these patients unless the patient was explicitly admitted, the field was left as unchecked resulting in a missing value, whereas if the patient was admitted they are much more likely to have a value flagged.

Regardless, ICU is a clearly promissing indicator.

  1. Medical Conditions vs Death

People with a medical condition noted are at an elevated risk from COVID. While the correlation does not appear to be as strong as ICU admission or being in an older age category, it is likely to be a relevant factor

  1. Race vs Death

We see that there are some races which appear to be disproportionally impacted by COVID, but by and large the proportions are similar except for some minority groups which are not heavily featured in the data set. As there does not appear to be a significant correlation, we do not use this.

Based on the above, we elect to include the following predictive features as having the most relevance to our model:

  1. Medical Condition
  2. ICU
  3. Age Group
  4. Hospitalisation

5. Model Creation and Analysis

0. Everybody Lives Model

Although it is not required, we need to determine what a 'default' model would look like. As the supermajority is that patients do not die from COVID, we want to see what a model would look like which predicts that everybody will live.

2. Predictive Modeling: Linear Regression.

In the cell below, we create a linear regression model and evaluate it using a number of metrics.

- (2.1) On the training set, train a linear regression model to predict the target feature, using only the  descriptive features selected in exercise (1) above.

In the function that generates the model, We are creating a Linear Regression model using only the features which were already listed. We train the model using the training set, completing this requirement.

- (2.2) Print the coefficients learned by the model and discuss their role in the model (e.g., interpret the model by analysing each coefficient and how it relates each input feature to the target feature).    

In the function that generates the model, we print out the features and the equation which is used. We see: death_yn = ('age_group_10 - 19 Years' 0.002270128169016561) +('age_group_20 - 29 Years' 0.00021675244052644926) +('age_group_30 - 39 Years' -0.0006233744350718495) +('age_group_40 - 49 Years' -0.0008676843355245704) +('age_group_50 - 59 Years' 0.0025145985771263604) +('age_group_60 - 69 Years' 0.03328400774759359) +('age_group_70 - 79 Years' 0.09617783507054159) +('age_group_80+ Years' 0.2645860166988681) +('age_group_Unknown' 0.10438765773694733) +('hosp_yn_OTH' -3.3306690738754696e-16) +('hosp_yn_Unknown' 0.012189583487253167) +('hosp_yn_Yes' 0.1728663400937809) +('icu_yn_Unknown' 0.01778246632302322) +('icu_yn_Yes' 0.26395229124215197) +('medcond_yn_Unknown' -0.005134164961470882) +('medcond_yn_Yes' 0.021699890131505622) + (-0.02191179249340093)

From this, we observe that the features with the highest positive correlation are the older age categories (70+), hospitalisation status, and ICU status. People with these elements flagged are more likely to be flagged as likely to die from COVID as the model takes these features heavily into account. As the threshold is .5, and the intercept is -.02, and as the age, hospitalisation, icu, and medical condition features are mutually exclusive (in the sense that you can only fall into one age, one ICU value, one hosp value, one med cond value), some combination of hospitalise Yes, ICU Yes, and Age > 70 would be required for the model to predict death as otherwise the coefficients will not add up significantly to meet the 0.5 threshold.

As this is a linear regression model, the coefficients relates to the unit change in the probability of the outcome given the presence of that feature. The intercept refers to a linear transform of the line (i.e. shifting up or down) and corresponds with the baseline case.

With regards to discussing each of these features, it seems unnecessary and overly verbose to run through the weighting of each feature; the key aspects have been highlighted and the weight and importance is clear.

- (2.3) Print the predicted target feature value for the first 10 training examples. Threshold the predicted target feature value given by the linear regression model at 0.5, to get the predicted class for each example. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.


In the generation function, I have 'predicted' the training data that was used to fit the model:

In the generated function I have thresholded the value and printed the first ten results.

In the generated function I have computed evaluation metrics.

This is not a 'good' evaluation technique as the model has been trained on that same data; as such these results carry little actual weight or insight into how the model will work on new data. Similarly, Linear Regression is not meant to be used in this manner as it is not a classification model and is not really designed for features like this.

At this stage, we see that the model has a high accuracy for predicting not death, however it also is over-enthusiastic in falsely classifying things into not death. While the results are slightly better than just flagging everybody as not death, the model ultimately at this stage appears to be quite poor. Particularly for COVID, we would prefer a model which is overly aggressive in classifying people as potential deaths to ensure those customers get priority treatment, rather than one which will falsely classify patients at significant risk as likely not to die.

These results are over the training set so not a reliable model to really look at, and these features will be examined more in the actual test data, however as an initial baseline we know that we are likely to receive a model which is accurate, but is accurate because it is poor at predicting death while the supermajority is not death leading to a high accuracy.

The results of this are listed below:


As required the First Ten Results predicting for the training data:

Actual Predicted PredictionClass Diff 8396 0 0.003143 0 0 987 0 0.005441 0 0 7274 0 -0.010131 0 0 1000 0 0.036210 0 0 4848 0 0.036210 0 0 9819 0 0.002058 0 0 7109 0 -0.009263 0 0 3123 0 -0.009887 0 0 5279 1 0.196887 0 1 7752 0 0.002303 0 0

----REPORT---- MAE: 0.03366751458925632 MSE: 0.03366751458925632 RMSE: 0.18348709651977252 R2: 0.003572408363848978 ----DETAIL----

Accuracy: 0.9663324854107437

Confusion matrix: [[6445 4] [ 221 13]]

Classification report: precision recall f1-score support

       0       0.97      1.00      0.98      6449
       1       0.76      0.06      0.10       234

accuracy                           0.97      6683

macro avg 0.87 0.53 0.54 6683 weighted avg 0.96 0.97 0.95 6683


- (2.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained on the training (70%) dataset. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated random train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and discuss your findings.


In the generation function, I have evaluated the model on the test set and kept these as the main results. Compared to the previously produced evaluation on the training data, the key shift we observe is that this results in notably worse predictive power for deaths. While the model retains a high accuracy, this is largely originating from the fact that it is much too heavily flagging people as not death. This is likely heavily influenced by the fact that in our data set the total amount of positive death instances is small and the data shifts heavily towards non-death. The act of splitting the full CDC dataset into groups of ten thousand is likely playing a significant impact in hampering the model's ability to correctly classify deaths. This suggests that in testing on a new data set, the model would likely be poorly performant in correctly classifying patients who are at significant risk from COVID.

In the generating function I have provided a cross-validation over 5 folds on the entire data set. We observe the RMSE of the cross validation averaged at 0.15688330546365153 vs. approx. 0.18 for when predicting the training set.

Based on our results, we see the f1 macro average is approx 59% on the test set which ultimately suggests that our prediction is only barely more accurate than guessing, and is ultimately only marginally better than the 0.49 received by flagging everybody as Not Death (particularly when the RMSE of that is similar at 0.18 also). This is slightly higher than on the training set but we observe that the model precision on both metrics has dropped.

Ultimately, based on the creation and analysis of our Linear Regression Model, we can determine that ultimately this is a poor model. While the accuracy is high, this is driven by the model heavily biasing towards flagging people as not death (flagging people who did die as not die incorrectly) when not death occupies the majority of our data set. Linear regression is not meant for feature classification in this manner, so it should be expected that the results of this are not particularly strong. We should ideally not use this model.

3. Predictive Modeling: Logistic Regression.

- (3.1) On the training set, train a logistic regression model to predict the target feature, using the descriptive features selected in exercise (1) above.   

As in linear regression, this is done.

- (3.2) Print the coefficients learned by the model and discuss their role in the model (e.g., interpret the model).    

death_yn = logistic( ('age_group_10 - 19 Years' -1.3954739647453331) +('age_group_20 - 29 Years' -1.7477619209329482) +('age_group_30 - 39 Years' -1.437896043371963) +('age_group_40 - 49 Years' -0.5467412391146604) +('age_group_50 - 59 Years' 0.08372144064151235) +('age_group_60 - 69 Years' 1.385612326921969) +('age_group_70 - 79 Years' 2.1476304546024774) +('age_group_80+ Years' 3.2574857584006645) +('age_group_Unknown' 0.8325318957143447) +('hosp_yn_OTH' 0.0) +('hosp_yn_Unknown' 0.8415435273812718) +('hosp_yn_Yes' 2.3672678007865713) +('icu_yn_Unknown' 0.4915252721992774) +('icu_yn_Yes' 1.9132227077415447) +('medcond_yn_Unknown' 0.3089176968434786) +('medcond_yn_Yes' 0.9468181138631563)

where logistic(x) is the standard log transform.

I.e. for F={(age features, age weighting),(hosp features, hosp weighting),(icu features, icu weighting), (med_cond features, med_con weighting)} we have $ \mathrm{P}(death\_yn=1|F)=\sum\limits_{f=(f_1,f_2) \in F} \frac{e ^ {-({-6.12674106 + f_2 * f_1})}}{1 +e ^ {-({-6.12674106 + f_2 * f_1})}}$

The coefficient therefore represents the log odd change and hence smaller changes have a more significant impact in the result particularly as the coefficient increases. To this end, we see that again the older age groups are significantly considered by the model, and again it is sensitive to ICU and Hospitalisation. The intercept refers to a shifting of the curve and dictates the base case.

- (3.3) Print the predicted target feature value for the first 10 training examples. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.

Again this is taken care of in the function. The first ten rows were printed and classification measures were printed.

Looking at the evaluation of the data on the training set, we see that compared to the linear regression model, the logistic regression model is doing a better job at predicting the death_yn feature. While there is an increase in patients being flagged as potentially dying, the number of true positives were more accurately determined over the training set, and macro average is significantly improved from where it had been in the previous linear regression model. The model is still underflagging deaths, which is a problem given that people at risk need the care given and must be identified in the context of healthcare, but it is an improvement on what had been achieved using a very simple regression model.


As required the First Ten Results predicting for the training data: Actual Predicted PredictionClass Diff 8396 0 0 0 0 987 0 0 0 0 7274 0 0 0 0 1000 0 0 0 0 4848 0 0 0 0 9819 0 0 0 0 7109 0 0 0 0 3123 0 0 0 0 5279 1 0 0 1 7752 0 0 0 0 ----REPORT---- MAE: 0.03321861439473291 MSE: 0.03321861439473291 RMSE: 0.1822597443066705 R2: 0.01685810958566436 ----DETAIL----

Accuracy: 0.966781385605267

Confusion matrix: [[6398 51] [ 171 63]]

Classification report: precision recall f1-score support

       0       0.97      0.99      0.98      6449
       1       0.55      0.27      0.36       234

accuracy                           0.97      6683

macro avg 0.76 0.63 0.67 6683 weighted avg 0.96 0.97 0.96 6683


- (3.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained when using the training (70%) dataset for evaluation. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and discuss your findings.


In the generating function I have printed these results. We can observe that this model is performing more strongly than our simple linear regression model and is achieving a macro f1-score on the training data of 0.70. This is slightly strong than what is achieved over the training set and is a good sign that the model is generalised and not overly fitted to the training data. In fact, almost all metrics on the test data is higher than what was observed for the training data. This might be an area which would warrant further investigation to understand why this is so and ensure that the comparatively strong results are a consequence of good generalisation. We ee the RMSE is .175 while over 5-fold validation this is .18 with the model being comparatively consistent. The model correctly identified 31 of the patients who had died of COVID which is a much stronger result than what was seen in the linear regression example. Particularly on account of the context in a healthcare setting, it is much more important that the model correctly flags patients who will die/are at risk than it is to correctly classify patients who are healthy, so long as the false positive rate isn't so high as to overburden the healthcare system.

4. Predictive Modeling: Random Forest

- (3.1) On the training set, train a random forest model to predict the target feature, using the descriptive features selected in exercise (1) above.  

This is done.

- (3.2) Can you interpret the random forest model? Discuss any knowledge you can gain in regard of the working of this model.    

As in other models, the generating function for the model has plotted the feature importance. We can see that the key features which the model is using to weigh the results are if you were admitted to the ICU, if you are in the over 80 age group,if you were hospitalised, if you are in the over 70 age group, and if you previously have medical conditions. (Being over 80 or hospitalised weighted with almost three times the weight of the next highest weighted feature).

- (3.3) Print the predicted target feature value for the first 10 training examples. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.

This is done.

Based on the training data results, we see that the macro avg f1 score is .72 and it has a high accuracy of 97%. This has been the most strongly performing model examined so far over the sample data which I have, but has a comparable performance to that of the Logistic Regression Model over the training data set. Similar to the other models which have been looked at, the biggest challenge for the model is in accurately classifying deaths, but this model has done better than all previous models in doing so when looking at the training sets.

- (3.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained when using the training (70%) dataset for evaluation. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and to the out-of-sample error and discuss your findings.


In the generating function I have printed these results. We can observe that this model is performing more strongly than our simple linear regression model, our everybody lives model, and our logistic regression model and is achieving a macro f1-score on the training data of 0.71 and has an ootb accuracy of 97% with a 5-fold ootb accuracy of 98.3%. While the training data is slightly worse than on the test data, the difference is quite negligible suggesting the model is well-generalised and not overfit. RMSE is lower than the Logistic model and the death accuracy is slightly higher. Based on these results, I would recommend that the RF model is what is used in a production setting.

6. Model Extension and Refinement

1. Compare Models

- (5.1) Which model of the ones trained above performs better at predicting the target feature? 

-Is it more   accurate than a simple model that always predicts the majority class (i.e., if 'no' is the majority class in your dataset, the simple model always predicts 'no' for the target feature)? Justify your answers.


In the cells below I have created some comparisons of each of the models which were built. I have also compared these models against one another while describing the performance individually in the previous sections. Each of the models performaned have better performance than the simple everybody lives performance largely driven by the fact that while the everybody lives model may have in some instances a higher accuracy, this is driven by the majority class being that patients live, yet it fails to predict death (due to the nature of the model). In the context of this assignment, that is an incredibly poor aspect of the model's performance as it means that patients who are at risk of dying from COVID would not be appropriately flagged and hence could receive a poor case outcome because doctor's are not aware of the underpinning workings of the model.

The model which has the best balance of true positives, false positives, and balancing all aspects of the model is the Random Forest Model which slightly eeks out ahead of the Regression (Logistic) model.Overall, the XGBoost model which was a non-mandatory aspect of the assignment is the best performing model, however its performance is overall comparable to the Random Forest model (this is not too surprising as the default operation of XGBoost is to use GBTrees in creating the model). The one downside for the XGBoost model is that the training time is more significant than other models due to the presence of GridSearch to try and pinpoint the optimal hyperparameters.

We see in the graphs below that over the test dataset, the performance of Logistic Regression, Random Forest, and XGBoost are all quite comparable with relatively minor performance differences between the three. I would recommend comparing and adding AutoML as an additional model, as it is highly performant and encorporates ensembling and hyperparameter optimisation and is well-regarded as an easily implementable out of the box ML kit from Google's team, but this is out of scope of the assignment.

Conclusion: Random Forest of the required models, XGBoost overall. Minor differences between Logistic, RF, and XGB. Avoid: Linear Regression, Simple.

2. Conclusion and Feature Addition

This has been slightly done in the previous section and is outlined in the problem scope section in the intro.

We are trying to predict whether a patient is likely to have a good (living) or bad (dying) prognosis based on a combination of demographic details and their patient history by creating ML models.

There are two key challenges with this:

  1. We have very poor data in relation to both scope (only 10k data points of which only a fraction are of deaths which is wildly insufficient for a decent model; SKLearn recommends no fewer than 50k datapoints) and depth (no access to patient level data [i.e. medical history, location] and a patient identifier).
  2. In a healthcare setting, it is very important that we do not miss true positives, as the risk to a patient of incorrectly classifying them as 'might die' is very serious. As such, we need a model which will minimise the number of missed deaths while also balancing the false positive rate.

To do this, we developed five models (simple, Linear Regression, Logistic Regression, Random Forest, and XGBoost) and compared their performance over both the training set and the test set, and analysed the results for each model, paying particular attention to the number of deaths which were produced and the overall precision.

Based on including ICU, Age, Medical Condition History, and Hospitalisation Status, we see that the Random Forest Model is the best performing model of the base models, and XGBoost is the overall strongest but was an additional model not mandatory for this assignment. The performance of Logistic Regression, Random Forest, and XGBoost were overall very comparable with minor performance differences between the three, while Linear Regression was poorly performant and the Simple Model, although it had a high accuracy, was totally inappropriate for the context of the problem.

Yes, we could create a Gender Specific Model. We see that gender distributions of death are different, so we could create one model for male patients, and one model for Female or Unknown Patients. We would then call the model which is relevant to the patients' gender. Alternatively, we could additional features or include an age split. We could also use XGBoost as I've already done.

(this is a summary and recap: please see full conclusions below)

As part of testing extensions of our model, we have tried:

  1. Developing a RF Model specific to Gender.
  2. The extension of our model to include all features.
  3. The creation of an XGBoost model

In all cases, each of the first two model potential improvements have resulted in an overall inferior model, except for the extension of the model to include all features for XGBoost where we got similar performance.

Based on these results, and particularly driven by the Xgboost model only attaining a performance onpar with the original model, I suspect the key requirement for improving results further will be to gather additional data, or as I have demonstrated with the usage of XGBoost, developing a new model.

Ideally, we would also gather more granular patient data to develop more 'hard-hitting' features such as history of pulmonary illness or cardiovascular risk indicators which are significant for COVID patients. Based on the performance of the XGBoost model, I believe there is only negligible room for improvement using these features outside of more sophisticated methods which are outside the scope of this module, and think the best chance for improvement will come from additional data being used to train the model.

- (4.3) Take your best model trained and selected based on past data (ie your cleaned Homework1 dataset), and evaluate it on the new test dataset provided with this homework (in file '24032021-covid19-cdc-deathyn-recent-10k.csv'). Discuss your findings. 

First, I read in the new file and determine it has data quality errors. As a result, I have to copy my assignment 1 file into a function. Because I'm assuming duplicates are required as part of the prediction, I do not drop rows which contain duplicates unlike in my assignment 1 submission, but all other components are the same. I then create a new Random Forest model which is trained over all of the Historic data file and use that for the prediction of each row in the new cleansed file.

(commentary and findings)

We see both XGboost and RF models perform worse on the new historical data set compared to the test results. Random Forest ended up slightly better (although both were very close) in macro avg f1 score.

Based on the drop in sensitivity in predicting deaths on the new dataset, it is likely that time is an important factor in the outcome of COVID. This makes sense as the original data includes cases from the beginning of the pandemic and probable cases, where mortality was high due to a poor understanding of the disease and uncertainty in how to treat it. As time has improved, the there has been a greater development in understanding of what factors are significant.

Due to this time effect, it is likely important to refresh the models and re-examine the features which are used to build the model, likely to account for the stage of the pandemic in which the diagnosis was featured, in order to account for this. This would likely lead to a higher degree of accuracy over the new dataset.

Ultimately, while neither model is optimal by any means, both models are good as an indicator and better than guessing. It would be important in the productionising of these models to ensure it is very clearly noted that the result is only an indicator as we still see that it fails to classify every death case.

Personally, because of the ethical considerations, I would advise that the model is not deployed unless the death classification can be significantly improved either by collecting vastly more data or by getting a better dataset to work with (particularly involving patient history or regional data).

01. Import Modules

We are going to use the GGPlot2 Stylesheet

02. Constants Per Previous Submission

03. Functions from Submission 1

I'm relisting the functions from Submission 1 as there's no point in rewriting these functions when they may have use in this submission.

03. Function to Group Over and Agg Columns

05. A function to create an XGBoost Refressor Model

0.0. Read in the Cleansed and Extended Data From my Submission 1.

We have saved the cleansed and extended file in Assignment 1 as a pickle, therefore we already have all of the assignment of column type complete. Columns will retain the same type as from where they were at the end of Assignment 1.

We print some info on what sort of columns are present, how much data we have, and the column types to validate.

Recall that: "onset_present" "cdc_case_earliest_day" "cdc_case_earliest_weekday" "cdc_case_earliest_month" "cdc_case_earliest_year" "demographic_missing" "medical_missing Were features which I added on at the end of Assignment 1. For the initial model creation, we will stick with the original features and drop these. We will return to some of these later in the assignment.

0.1 Staging Dataframe

This is going to contain the original features which were present.

4. Exploratory Analysis

In this section we begin the Assignment 2 Material explicitly.

Most of this is a redundant replica of the analysis in Assignment 1 but directed at the training data specifically. We leverage our defined analytics functions above to allow for a concise analysis of the data which is quickly adapted to new datasets.

1.0 Test Train Split

We will split the data 70-30 and analyse our test data.

1.1 Let's do a sense check on the Testing Data and Training Data

We see the train and test data is split correctly (i.e. in the right proportion) and the total length matches the original data set. We are now good to start analysing our training data.

This corresponds with task 1.1

2.0 Explore Data vs Target

We plot all pairs of categorical features against the target column.

2.1 Analysis and Commentary on Exploration

  1. current_status vs death

Probable cases interestingly are slightly more like than laboratory confirmed cases to have resulted in death. Suspicion is that this is due to retrospective classification of this data. Overall probably case volume is small (>5% of data) so this is likely not a significant indicator for future data.

  1. sex vs death

Males are more likely to be flagged as death in the training data set. There is potentially a higher risk for males. A key factor as to why this might be is that the life expectancy for males in the US is lower than females. Particularly in the older categories, males might be at a more pronounced risk due to overall lower life expectancy resulting in a greater susceptibility to COVID. Ultimately, the differences are relatively minor by sex, so this is unlikely to have a significant sway within our model given that there are some features (e.g. age, icu, medcond, hosp) which have a more significant correlation.

  1. age_group vs death

Age group is a highly significant factor. From 40+, there is a greatly increased likelhiood of death. This aligns highly with what is currently known about COVID, where people in older categories are at a greatly increased risk of significant COVID complications. We see that particularly in people in the 80+ category, there is a very significant motality rate increase, and as such this is likely to be a very important predictor of death in our dataset.

  1. hospitalised vs death

We see that hospitalisation is correlated with an increased likelihood of death within the set which is not too surprising given that those who are hospitalised are more likely to have a more serious presentation of COVID than those who do not require hospitalisation.

  1. ICU vs death

We see ICU admission has a very significant impact on whether somebody is likely to be flagged as having died. Similarly to hospitalisation, this is likely because those who require an ICU admission are likely to have an extreme presentation of COVID, and as such being admitted to ICU is likely to be a very strong indicator of your prognosis. The proportion of missing aligns with those who were not admitted to the ICU. As described in HW1, I strongly suspect that a missing ICU indicator is in fact an indicator that the patient was not admitted to the ICU. The proportion of admissions supports this, and as analysed in assignment 1, this is further supported by the ICU flag being most heavily missing in patients in younger age categories; likely for these patients unless the patient was explicitly admitted, the field was left as unchecked resulting in a missing value, whereas if the patient was admitted they are much more likely to have a value flagged.

Regardless, ICU is a clearly promissing indicator.

  1. Medical Conditions vs Death

People with a medical condition noted are at an elevated risk from COVID. While the correlation does not appear to be as strong as ICU admission or being in an older age category, it is likely to be a relevant factor

  1. Race vs Death

We see that there are some races which appear to be disproportionally impacted by COVID, but by and large the proportions are similar except for some minority groups which are not heavily featured in the data set. As there does not appear to be a significant correlation, we do not use this.

2.2 Conclusion

Based on the above, we elect to include the following predictive features as having the most relevance to our model:

  1. Medical Condition
  2. ICU
  3. Age Group
  4. Hospitalisation

As we do not have continuous features in our data set, this concludes 2.2

2.3 Re-run Test/Train split after dummying our data.

We encode the target feature to an int and then one hot encode our predictive features.

2.4 Correlation Heatmap

Now that we've dummified the data, and converted relevant columns into numeric values, we will create a heatmap and correlation matrix.

We can see that death_yn is moderately positively correlated with icu, 80+, and med cond in particular as we ascertained from our previous analysis. We suspect these factors will be weighed heavily in our model.

5. Create Models and Analyse

Now that we have our target features and training and test data, we need to create the a Linear Model, a Logistic Regression Model, and a RF Model.

I have created functions above which are generalised enough to create each of these models and flag all of the metrics which are required for evaluation and the tasks listed in the exercise. As such, the primary code remaining is to call these models and analyse the results.

0. Everybody Lives Model

Although it is not required, we need to determine what a 'default' model would look like. As the supermajority is that patients do not die from COVID, we want to see what a model would look like which predicts that everybody will live.

2. Predictive Modeling: Linear Regression.

In the cell below, we create a linear regression model and evaluate it using a number of metrics.

- (2.1) On the training set, train a linear regression model to predict the target feature, using only the  descriptive features selected in exercise (1) above.

In the function that generates the model, We are creating a Linear Regression model using only the features which were already listed. We train the model using the training set, completing this requirement.

- (2.2) Print the coefficients learned by the model and discuss their role in the model (e.g., interpret the model by analysing each coefficient and how it relates each input feature to the target feature).    

In the function that generates the model, we print out the features and the equation which is used. We see:

death_yn =

('age_group_10 - 19 Years' * 0.002270128169016561)

+('age_group_20 - 29 Years' * 0.00021675244052644926)

+('age_group_30 - 39 Years' * -0.0006233744350718495)

+('age_group_40 - 49 Years' * -0.0008676843355245704)

+('age_group_50 - 59 Years' * 0.0025145985771263604)

+('age_group_60 - 69 Years' * 0.03328400774759359)

+('age_group_70 - 79 Years' * 0.09617783507054159)

+('age_group_80+ Years' * 0.2645860166988681)

+('age_group_Unknown' * 0.10438765773694733)

+('hosp_yn_OTH' * -3.3306690738754696e-16)

+('hosp_yn_Unknown' * 0.012189583487253167)

+('hosp_yn_Yes' * 0.1728663400937809)

+('icu_yn_Unknown' * 0.01778246632302322)

+('icu_yn_Yes' * 0.26395229124215197)

+('medcond_yn_Unknown' * -0.005134164961470882)

+('medcond_yn_Yes' * 0.021699890131505622) + (-0.02191179249340093)

From this, we observe that the features with the highest positive correlation are the older age categories (70+), hospitalisation status, and ICU status. People with these elements flagged are more likely to be flagged as likely to die from COVID as the model takes these features heavily into account. As the threshold is .5, and the intercept is -.02, and as the age, hospitalisation, icu, and medical condition features are mutually exclusive (in the sense that you can only fall into one age, one ICU value, one hosp value, one med cond value), some combination of hospitalise Yes, ICU Yes, and Age > 70 would be required for the model to predict death as otherwise the coefficients will not add up significantly to meet the 0.5 threshold.

As this is a linear regression model, the coefficients relates to the unit change in the probability of the outcome given the presence of that feature. The intercept refers to a linear transform of the line (i.e. shifting up or down) and corresponds with the baseline case.

With regards to discussing each of these features, it seems unnecessary and overly verbose to run through the weighting of each feature; the key aspects have been highlighted and the weight and importance is clear.

- (2.3) Print the predicted target feature value for the first 10 training examples. Threshold the predicted target feature value given by the linear regression model at 0.5, to get the predicted class for each example. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.


In the generation function, I have 'predicted' the training data that was used to fit the model:

In the generated function I have thresholded the value and printed the first ten results.

In the generated function I have computed evaluation metrics.

This is not a 'good' evaluation technique as the model has been trained on that same data; as such these results carry little actual weight or insight into how the model will work on new data. Similarly, Linear Regression is not meant to be used in this manner as it is not a classification model and is not really designed for features like this.

At this stage, we see that the model has a high accuracy for predicting not death, however it also is over-enthusiastic in falsely classifying things into not death. While the results are slightly better than just flagging everybody as not death, the model ultimately at this stage appears to be quite poor. Particularly for COVID, we would prefer a model which is overly aggressive in classifying people as potential deaths to ensure those customers get priority treatment, rather than one which will falsely classify patients at significant risk as likely not to die.

These results are over the training set so not a reliable model to really look at, and these features will be examined more in the actual test data, however as an initial baseline we know that we are likely to receive a model which is accurate, but is accurate because it is poor at predicting death while the supermajority is not death leading to a high accuracy.

The results of this are listed below:


As required the First Ten Results predicting for the training data:

Actual Predicted PredictionClass Diff 8396 0 0.003143 0 0 987 0 0.005441 0 0 7274 0 -0.010131 0 0 1000 0 0.036210 0 0 4848 0 0.036210 0 0 9819 0 0.002058 0 0 7109 0 -0.009263 0 0 3123 0 -0.009887 0 0 5279 1 0.196887 0 1 7752 0 0.002303 0 0

----REPORT---- MAE: 0.03366751458925632 MSE: 0.03366751458925632 RMSE: 0.18348709651977252 R2: 0.003572408363848978 ----DETAIL----

Accuracy: 0.9663324854107437

Confusion matrix: [[6445 4] [ 221 13]]

Classification report: precision recall f1-score support

       0       0.97      1.00      0.98      6449
       1       0.76      0.06      0.10       234

accuracy                           0.97      6683

macro avg 0.87 0.53 0.54 6683 weighted avg 0.96 0.97 0.95 6683


- (2.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained on the training (70%) dataset. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated random train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and discuss your findings.


In the generation function, I have evaluated the model on the test set and kept these as the main results. Compared to the previously produced evaluation on the training data, the key shift we observe is that this results in notably worse predictive power for deaths. While the model retains a high accuracy, this is largely originating from the fact that it is much too heavily flagging people as not death. This is likely heavily influenced by the fact that in our data set the total amount of positive death instances is small and the data shifts heavily towards non-death. The act of splitting the full CDC dataset into groups of ten thousand is likely playing a significant impact in hampering the model's ability to correctly classify deaths. This suggests that in testing on a new data set, the model would likely be poorly performant in correctly classifying patients who are at significant risk from COVID.

In the generating function I have provided a cross-validation over 5 folds on the entire data set. We observe the RMSE of the cross validation averaged at 0.15688330546365153 vs. approx. 0.18 for when predicting the training set.

Based on our results, we see the f1 macro average is approx 59% on the test set which ultimately suggests that our prediction is only barely more accurate than guessing, and is ultimately only marginally better than the 0.49 received by flagging everybody as Not Death (particularly when the RMSE of that is similar at 0.18 also). This is slightly higher than on the training set but we observe that the model precision on both metrics has dropped.

Ultimately, based on the creation and analysis of our Linear Regression Model, we can determine that ultimately this is a poor model. While the accuracy is high, this is driven by the model heavily biasing towards flagging people as not death (flagging people who did die as not die incorrectly) when not death occupies the majority of our data set. Linear regression is not meant for feature classification in this manner, so it should be expected that the results of this are not particularly strong. We should ideally not use this model.

We explicitly look at our test set's actual versus prediction death predictions to see the true positive for both metrics.

3. Predictive Modeling: Logistic Regression.

- (3.1) On the training set, train a logistic regression model to predict the target feature, using the descriptive features selected in exercise (1) above.   

As in linear regression, this is done.

- (3.2) Print the coefficients learned by the model and discuss their role in the model (e.g., interpret the model).    

death_yn = logistic( ('age_group_10 - 19 Years' -1.3954739647453331) +('age_group_20 - 29 Years' -1.7477619209329482) +('age_group_30 - 39 Years' -1.437896043371963) +('age_group_40 - 49 Years' -0.5467412391146604) +('age_group_50 - 59 Years' 0.08372144064151235) +('age_group_60 - 69 Years' 1.385612326921969) +('age_group_70 - 79 Years' 2.1476304546024774) +('age_group_80+ Years' 3.2574857584006645) +('age_group_Unknown' 0.8325318957143447) +('hosp_yn_OTH' 0.0) +('hosp_yn_Unknown' 0.8415435273812718) +('hosp_yn_Yes' 2.3672678007865713) +('icu_yn_Unknown' 0.4915252721992774) +('icu_yn_Yes' 1.9132227077415447) +('medcond_yn_Unknown' 0.3089176968434786) +('medcond_yn_Yes' 0.9468181138631563)

where logistic(x) is the standard log transform.

I.e. for F={(age features, age weighting),(hosp features, hosp weighting),(icu features, icu weighting), (med_cond features, med_con weighting)} we have $ \mathrm{P}(death\_yn=1|F)=\sum\limits_{f=(f_1,f_2) \in F} \frac{e ^ {-({-6.12674106 + f_2 * f_1})}}{1 +e ^ {-({-6.12674106 + f_2 * f_1})}}$

The coefficient therefore represents the log odd change and hence smaller changes have a more significant impact in the result particularly as the coefficient increases. To this end, we see that again the older age groups are significantly considered by the model, and again it is sensitive to ICU and Hospitalisation. The intercept refers to a shifting of the curve and dictates the base case.

- (3.3) Print the predicted target feature value for the first 10 training examples. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.

Again this is taken care of in the function. The first ten rows were printed and classification measures were printed.

Looking at the evaluation of the data on the training set, we see that compared to the linear regression model, the logistic regression model is doing a better job at predicting the death_yn feature. While there is an increase in patients being flagged as potentially dying, the number of true positives were more accurately determined over the training set, and macro average is significantly improved from where it had been in the previous linear regression model. The model is still underflagging deaths, which is a problem given that people at risk need the care given and must be identified in the context of healthcare, but it is an improvement on what had been achieved using a very simple regression model.


As required the First Ten Results predicting for the training data: Actual Predicted PredictionClass Diff 8396 0 0 0 0 987 0 0 0 0 7274 0 0 0 0 1000 0 0 0 0 4848 0 0 0 0 9819 0 0 0 0 7109 0 0 0 0 3123 0 0 0 0 5279 1 0 0 1 7752 0 0 0 0 ----REPORT---- MAE: 0.03321861439473291 MSE: 0.03321861439473291 RMSE: 0.1822597443066705 R2: 0.01685810958566436 ----DETAIL----

Accuracy: 0.966781385605267

Confusion matrix: [[6398 51] [ 171 63]]

Classification report: precision recall f1-score support

       0       0.97      0.99      0.98      6449
       1       0.55      0.27      0.36       234

accuracy                           0.97      6683

macro avg 0.76 0.63 0.67 6683 weighted avg 0.96 0.97 0.96 6683


- (3.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained when using the training (70%) dataset for evaluation. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and discuss your findings.


In the generating function I have printed these results. We can observe that this model is performing more strongly than our simple linear regression model and is achieving a macro f1-score on the training data of 0.70. This is slightly strong than what is achieved over the training set and is a good sign that the model is generalised and not overly fitted to the training data. In fact, almost all metrics on the test data is higher than what was observed for the training data. This might be an area which would warrant further investigation to understand why this is so and ensure that the comparatively strong results are a consequence of good generalisation. We ee the RMSE is .175 while over 5-fold validation this is .18 with the model being comparatively consistent. The model correctly identified 31 of the patients who had died of COVID which is a much stronger result than what was seen in the linear regression example. Particularly on account of the context in a healthcare setting, it is much more important that the model correctly flags patients who will die/are at risk than it is to correctly classify patients who are healthy, so long as the false positive rate isn't so high ass to overburden the healthcare system.

We explicitly look at our test set's actual versus prediction death predictions to see the true positive for both metrics.

4. Predictive Modeling: Random Forest

- (3.1) On the training set, train a random forest model to predict the target feature, using the descriptive features selected in exercise (1) above.  

This is done.

- (3.2) Can you interpret the random forest model? Discuss any knowledge you can gain in regard of the working of this model.    

As in other models, the generating function for the model has plotted the feature importance. We can see that the key features which the model is using to weigh the results are if you were admitted to the ICU, if you are in the over 80 age group,if you were hospitalised, if you are in the over 70 age group, and if you previously have medical conditions. (Being over 80 or hospitalised weighted with almost three times the weight of the next highest weighted feature).

- (3.3) Print the predicted target feature value for the first 10 training examples. Print the predicted class for the first 10 examples. Print a few classification evaluation measures computed on the full training set (e.g., Accuracy, Confusion matrix, Precision, Recall, F1) and discuss your findings so far.

This is done.

Based on the training data results, we see that the macro avg f1 score is .72 and it has a high accuracy of 97%. This has been the most strongly performing model examined so far over the sample data which I have, but has a comparable performance to that of the Logistic Regression Model over the training data set. Similar to the other models which have been looked at, the biggest challenge for the model is in accurately classifying deaths, but this model has done better than all previous models in doing so when looking at the training sets.

- (3.4) Evaluate the model using classification evaluation measures on the hold-out (30% examples) test set. Compare these results with the evaluation results obtained when using the training (70%) dataset for evaluation. Also compare these results with a cross-validated model (i.e., a new model trained and evaluated using cross-validation on the full dataset). You can use classic k-fold cross-validation or repeated train/test (70/30) splits. Compare the cross-validation metrics to those obtained on the single train/test split and to the out-of-sample error and discuss your findings.


In the generating function I have printed these results. We can observe that this model is performing more strongly than our simple linear regression model, our everybody lives model, and our logistic regression model and is achieving a macro f1-score on the training data of 0.71 and has an ootb accuracy of 97% with a 5-fold ootb accuracy of 98.3%. While the training data is slightly worse than on the test data, the difference is quite negligible suggesting the model is well-generalised and not overfit. RMSE is lower than the Logistic model and the death accuracy is slightly higher. Based on these results, I would recommend that the RF model is what is used in a production setting.

6. Model Extension and Analysis

0. XGBoost

XGBoost models are Kaggle-winning extreme gradient boosting models adapted from AdaBoosting which is highly used in industry due to fast classification, powerful models, and a strong degree of parameters to tune.

I start by building an XGBoost model which has the hyperparameters tuned via GridSearch to assess what an alternative model may look like and compare to what was built.

1. Compare Models

- (5.1) Which model of the ones trained above performs better at predicting the target feature? 

-Is it more   accurate than a simple model that always predicts the majority class (i.e., if 'no' is the majority class in your dataset, the simple model always predicts 'no' for the target feature)? Justify your answers.


In the cells below I have created some comparisons of each of the models which were built. I have also compared these models against one another while describing the performance individually in the previous sections. Each of the models performaned have better performance than the simple everybody lives performance largely driven by the fact that while the everybody lives model may have in some instances a higher accuracy, this is driven by the majority class being that patients live, yet it fails to predict death (due to the nature of the model). In the context of this assignment, that is an incredibly poor aspect of the model's performance as it means that patients who are at risk of dying from COVID would not be appropriately flagged and hence could receive a poor case outcome because doctor's are not aware of the underpinning workings of the model.

The model which has the best balance of true positives, false positives, and balancing all aspects of the model is the Random Forest Model which slightly eeks out ahead of the Regression (Logistic) model.Overall, the XGBoost model which was a non-mandatory aspect of the assignment is the best performing model, however its performance is overall comparable to the Random Forest model (this is not too surprising as the default operation of XGBoost is to use GBTrees in creating the model). The one downside for the XGBoost model is that the training time is more significant than other models due to the presence of GridSearch to try and pinpoint the optimal hyperparameters.

We see in the graphs below that over the test dataset, the performance of Logistic Regression, Random Forest, and XGBoost are all quite comparable with relatively minor performance differences between the three. I would recommend comparing and adding AutoML as an additional model, as it is highly performant and encorporates ensembling and hyperparameter optimisation and is well-regarded as an easily implementable out of the box ML kit from Google's team, but this is out of scope of the assignment.

Conclusion and Feature Addition

This has been slightly done in the previous section and is outlined in the problem scope section in the intro.

We are trying to predict whether a patient is likely to have a good (living) or bad (dying) prognosis based on a combination of demographic details and their patient history by creating ML models.

There are two key challenges with this:

  1. We have very poor data in relation to both scope (only 10k data points of which only a fraction are of deaths which is wildly insufficient for a decent model; SKLearn recommends no fewer than 50k datapoints) and depth (no access to patient level data [i.e. medical history, location] and a patient identifier).
  2. In a healthcare setting, it is very important that we do not miss true positives, as the risk to a patient of incorrectly classifying them as 'might die' is very serious. As such, we need a model which will minimise the number of missed deaths while also balancing the false positive rate.

To do this, we developed five models (simple, Linear Regression, Logistic Regression, Random Forest, and XGBoost) and compared their performance over both the training set and the test set, and analysed the results for each model, paying particular attention to the number of deaths which were produced and the overall precision.

Based on including ICU, Age, Medical Condition History, and Hospitalisation Status, we see that the Random Forest Model is the best performing model of the base models, and XGBoost is the overall strongest but was an additional model not mandatory for this assignment. The performance of Logistic Regression, Random Forest, and XGBoost were overall very comparable with minor performance differences between the three, while Linear Regression was poorly performant and the Simple Model, although it had a high accuracy, was totally inappropriate for the context of the problem.

Yes, we could create a Gender Specific Model. We see that gender distributions of death are different, so we could create one model for male patients, and one model for Female or Unknown Patients. We would then call the model which is relevant to the patients' gender. Alternatively, we could additional features or include an age split. We could also use XGBoost as I've already done.

Result:

Although it is split in the above by gender rather than combined, I can clearly see in reviewing the results that this model is less predictive than the original model. Likely, this is because the data has been diluted by splitting it and has resulted in poorer training of the model. As such, the split by age group which I had previously suggested is something which I know longer believe would be useful.

Add additional features

We could try add in additional features rather than just 4. Although this is likely to result in dimensionality issues we could see what happens

Recap and Conclusion

As part of testing extensions of our model, we have tried:

  1. Developing a RF Model specific to Gender.
  2. The extension of our model to include all features.
  3. The creation of an XGBoost model

In all cases, each of the first two model potential improvements have resulted in an overall inferior model, except for the extension of the model to include all features for XGBoost where we got similar performance.

Based on these results, and particularly driven by the Xgboost model only attaining a performance onpar with the original model, I suspect the key requirement for improving results further will be to gather additional data, or as I have demonstrated with the usage of XGBoost, developing a new model.

Ideally, we would also gather more granular patient data to develop more 'hard-hitting' features such as history of pulmonary illness or cardiovascular risk indicators which are significant for COVID patients. Based on the performance of the XGBoost model, I believe there is only negligible room for improvement using these features outside of more sophisticated methods which are outside the scope of this module, and think the best chance for improvement will come from additional data being used to train the model.


Test on New Data

- (4.3) Take your best model trained and selected based on past data (ie your cleaned Homework1 dataset), and evaluate it on the new test dataset provided with this homework (in file '24032021-covid19-cdc-deathyn-recent-10k.csv'). Discuss your findings. 

First, I read in the new file and determine it has data quality errors. As a result, I have to copy my assignment 1 file into a function. Because I'm assuming duplicates are required as part of the prediction, I do not drop rows which contain duplicates unlike in my assignment 1 submission, but all other components are the same. I then create a new Random Forest model which is trained over all of the Historic data file and use that for the prediction of each row in the new cleansed file.

Note: I do not just call predict directly on the file I already trained as this was trained on the training data split i.e. 70% of the original file. Copying the existing function and setting the training data as the original data, and the new file as the test data was a lot more convenient as it also included analysis. I create both an XGBoost model and also a RF model (as the best of the models we were required to make).

Read in the new file.

Define a function that cleanses it identically to my Assignment 1 data

Get my new cleansed ADF data

Random Forest and XGBoost functions to work over the full Historical Dataset. New file names.

Manipulate old and new data with best model params.

Random Forest

XGBoost

Commentary and Conclusion on New Data

We see both XGboost and RF models perform worse on the new historical data set compared to the test results. Random Forest ended up slightly better (although both were very close) in macro avg f1 score.

Based on the drop in sensitivity in predicting deaths on the new dataset, it is likely that time is an important factor in the outcome of COVID. This makes sense as the original data includes cases from the beginning of the pandemic and probable cases, where mortality was high due to a poor understanding of the disease and uncertainty in how to treat it. As time has improved, the there has been a greater development in understanding of what factors are significant.

Due to this time effect, it is likely important to refresh the models and re-examine the features which are used to build the model, likely to account for the stage of the pandemic in which the diagnosis was featured, in order to account for this. This would likely lead to a higher degree of accuracy over the new dataset.

Ultimately, while neither model is optimal by any means, both models are good as an indicator and better than guessing. It would be important in the productionising of these models to ensure it is very clearly noted that the result is only an indicator as we still see that it fails to classify every death case.

Personally, because of the ethical considerations, I would advise that the model is not deployed unless the death classification can be significantly improved either by collecting vastly more data or by getting a better dataset to work with (particularly involving patient history or regional data).